Skip to content

fix: GBx00 misclassified as DPU during PXE boot#2932

Merged
krish-nvidia merged 2 commits into
NVIDIA:mainfrom
krish-nvidia:gbx00-pxe-classification
Jun 26, 2026
Merged

fix: GBx00 misclassified as DPU during PXE boot#2932
krish-nvidia merged 2 commits into
NVIDIA:mainfrom
krish-nvidia:gbx00-pxe-classification

Conversation

@krish-nvidia

Copy link
Copy Markdown
Contributor

Problem

On ARM hosts without a minted machine ID yet, the PXE boot path classified a machine as a DPU if its explored endpoint had any BlueField part number in its chassis inventory (has_bluefield_part_number()). The GBx00 host BMC reports its BlueField-3 as a chassis object, so the host was misclassified as a DPU and served the DPU OS (carbide.efi / carbide.root) instead of the host image (scout.efi).

Fix

In crates/api-core/src/ipxe.rs, we now treat an endpoint as a DPU only when the explored endpoint is itself a DPU BMC (endpoint.report.is_dpu(), i.e. Systems[0].Id == "Bluefield") rather than merely containing a BlueField part number.

Related issues

#2930

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Signed-off-by: Krish Dandiwala <kdandiwala@nvidia.com>
@krish-nvidia krish-nvidia self-assigned this Jun 26, 2026
@krish-nvidia krish-nvidia requested a review from a team as a code owner June 26, 2026 18:59
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

  • Bug Fixes
    • Improved ARM boot instruction selection for endpoints with unknown interfaces or missing machine IDs.
    • DPU devices are now identified more accurately, helping them receive the correct boot instructions.

Walkthrough

The ARM fallback in PxeInstructions::get_pxe_instructions now uses endpoint.report.is_dpu() to select DPU-specific iPXE instructions when the interface is unknown or the minted machine ID is missing.

Changes

ARM fallback classification

Layer / File(s) Summary
DPU fallback check
crates/api-core/src/ipxe.rs
The ARM fallback comment and predicate now use endpoint.report.is_dpu() instead of endpoint.has_bluefield_part_number() to choose the DPU MachineType path.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~5 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/infra-controller#2670: This PR changes how is_dpu is derived, which feeds the classification used by the ARM fallback here.
  • NVIDIA/infra-controller#2909: This PR touched the same ARM fallback branch in crates/api-core/src/ipxe.rs and is directly connected to the same decision path.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main fix: GBx00 hosts were being misclassified as DPUs during PXE boot.
Description check ✅ Passed The description is directly aligned with the code change and clearly explains the bug, root cause, and fix.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/api-core/src/ipxe.rs (1)

369-386: 🎯 Functional Correctness | 🟠 Major

Add a regression test for the ARM non-DPU fallback.

endpoint.report.is_dpu() is the right predicate here, but there is still no test for the exact case this fixes: ARM PXE with a BlueField part number in inventory where the endpoint is not a DPU BMC. The current coverage exercises the DPU path via target.product, so this branch can still regress silently. Add a case that asserts aarch64/scout.efi for that scenario.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/src/ipxe.rs` around lines 369 - 386, Add a regression test
covering the ARM non-DPU fallback in the PXE logic so this branch cannot regress
silently. Update the test coverage around
`PxeInstructions::get_pxe_instruction_for_arch` in `ipxe.rs` to use an ARM
target with a BlueField part number in inventory while
`endpoint.report.is_dpu()` is false, and assert it selects the ARM fallback
output `aarch64/scout.efi`. Keep the existing DPU-path coverage intact and add
this as the distinct non-DPU case.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/api-core/src/ipxe.rs`:
- Around line 369-386: Add a regression test covering the ARM non-DPU fallback
in the PXE logic so this branch cannot regress silently. Update the test
coverage around `PxeInstructions::get_pxe_instruction_for_arch` in `ipxe.rs` to
use an ARM target with a BlueField part number in inventory while
`endpoint.report.is_dpu()` is false, and assert it selects the ARM fallback
output `aarch64/scout.efi`. Keep the existing DPU-path coverage intact and add
this as the distinct non-DPU case.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: eae6d5fa-c383-4cc7-bee6-27d90374c889

📥 Commits

Reviewing files that changed from the base of the PR and between b4b2622 and 5950bbd.

📒 Files selected for processing (1)
  • crates/api-core/src/ipxe.rs

@github-actions

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 285 6 26 102 7 144
machine-validation-runner 744 32 188 267 36 221
machine_validation 744 32 188 267 36 221
machine_validation-aarch64 744 32 188 267 36 221
nvmetal-carbide 744 32 188 267 36 221
TOTAL 3267 134 778 1176 151 1028

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@krish-nvidia krish-nvidia merged commit 3e03d27 into NVIDIA:main Jun 26, 2026
58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants